Bikeshare systems are an excellent, low-carbon approach to moving
people from point A to B in cities. However, they have unique challenges
related to fluctuations in supply and demand of bikes over time and
across space. At the core of any successful bikeshare program is a
balanced inventory, where all docks have bikes to ride and spots to dock
available. Anticipating user demand in these systems allows
administrators to move bikes ahead of that demand.
In order to
effectively balance the system, systems administrators should focus on
moving bikes on either side of the the rush hour peaks and bringing
bikes from docks at fringe stations to core stations. One way to do this
would be to ‘gamify’ balancing for riders through incentivizing riders
to bring bikes back to their origin stations at the end of the day
and/or incentivizing users to move bikes from full stations to empty
stations during off-peak hours to reduce need of trucks bringing bikes
to and fro. To manage the remainder of the re-balancing need, bikeshare
administrators should direct a small crew running a fleet of trucks to
collect bikes and rebalance them according to projected future demand.
Predicting demand two weeks into the future would balance the competing
needs of accuracy and proactive scheduling of employees.
dat2 <- dat %>%
mutate(interval60 = floor_date(start_time, unit = "hour"),
interval15 = floor_date(start_time, unit = "15 mins"),
week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
filter(end_lon != "", end_lat != "",
start_lon != "", start_lat != "")
phillyCensus <-
get_acs(geography = "tract",
variables = c("B01003_001", "B19013_001",
"B02001_002", "B08013_001",
"B08012_001", "B08301_001",
"B08301_010", "B01002_001"),
year = 2021,
state = "PA",
geometry = TRUE,
county=c("Philadelphia"),
output = "wide") %>%
rename(Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
Med_Age = B01002_001E,
White_Pop = B02001_002E,
Travel_Time = B08013_001E,
Num_Commuters = B08012_001E,
Means_of_Transport = B08301_001E,
Total_Public_Trans = B08301_010E) %>%
dplyr::select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
Means_of_Transport, Total_Public_Trans,
Med_Age,
GEOID, geometry) %>%
mutate(Percent_White = White_Pop / Total_Pop,
Mean_Commute_Time = Travel_Time / Total_Public_Trans,
Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)
phillyTracts <-
phillyCensus %>%
as.data.frame() %>%
distinct(GEOID, .keep_all = TRUE) %>%
dplyr::select(GEOID, geometry) %>%
st_sf
dat_census <- st_join(dat2 %>%
filter(is.na(start_lon) == FALSE &
is.na(start_lat) == FALSE &
is.na(end_lat) == FALSE &
is.na(end_lon) == FALSE) %>%
st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326),
phillyTracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Origin.Tract = GEOID) %>%
mutate(from_longitude = unlist(map(geometry, 1)),
from_latitude = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
dplyr::select(-geometry)%>%
st_as_sf(., coords = c("end_lon", "end_lat"), crs = 4326) %>%
st_join(., phillyTracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Destination.Tract = GEOID) %>%
mutate(end_lon = unlist(map(geometry, 1)),
end_lat = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
dplyr::select(-geometry)
weather.Panel <-
riem_measures(station = "PHL", date_start = "2023-05-01", date_end = "2023-06-04") %>%
dplyr::select(valid, tmpf, p01i, sknt)%>%
replace(is.na(.), 0) %>%
mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
group_by(interval60) %>%
summarize(Temperature = max(tmpf),
Precipitation = sum(p01i),
Wind_Speed = max(sknt)) %>%
mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
grid.arrange(
ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() +
labs(title="Percipitation", x="Hour", y="Precipitation") + plotTheme,
ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() +
labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() +
labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
top="Weather Data - Philadelphia PHL - May-June, 2023")
Here is the weather data, showing how temperature and wind speed
are somewhat tied together and fluctuate with the day-night swing.
ggplot(dat_census %>%
group_by(interval60) %>%
tally())+
geom_line(aes(x = interval60, y = n))+
labs(title="Bike share trips per hr. Philadelphia, May-June, 2023",
x="Date",
y="Number of trips")+
plotTheme
This plot shows a relatively consistent daily fluctuation, with two
peaks on each day of the work week - one in the morning and a larger one
in the evening rush hour.
Also visible here is the relatively lower ridership on weekends. Particularly notable on May 13 and 20.
dat_census %>%
mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
group_by(interval60, start_station, time_of_day) %>%
tally()%>%
group_by(start_station, time_of_day)%>%
summarize(mean_trips = mean(n))%>%
ggplot()+
geom_histogram(aes(mean_trips), binwidth = 1)+
labs(title="Mean Number of Hourly Trips Per Station. Philadelphia, May, 2023",
x="Number of trips",
y="Frequency")+
facet_wrap(~time_of_day)+
plotTheme
Mid-day and PM rush have more normal distributions, whereas AM rush
and Overnight are skewed towards mostly one or two trips.
ggplot(dat_census %>%
group_by(interval60, start_station) %>%
tally())+
geom_histogram(aes(n), binwidth = 1)+
labs(title="Bike share trips per hr by station. Philadelphia, May, 2023",
x="Trip Counts",
y="Number of Stations")+
plotTheme
Overall, the most frequent count is a single trip.
dat_census <- dat_census %>%
mutate(start_time = as.POSIXct(start_time, format = "%m/%d/%Y %H:%M"))
# ggplot(dat_census %>% mutate(hour = hour(start_time)))+
# geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
# labs(title="Bike share trips in Philadelphia, by day of the week, May, 2018",
# x="Hour",
# y="Trip Counts")+
# plotTheme
ggplot(dat_census %>%
mutate(hour = hour(start_time),
weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
labs(title="Bike share trips in Philadelphia - weekend vs weekday, May, 2018",
x="Hour",
y="Trip Counts")+
plotTheme
From this graph we can see one peak per day on weekends versus two
on weekdays. On average, ridership is lower on weekends. One key
takeaway from this graph is the greater need for balancing the supply of
bikes on weekdays than on weekends.
ggplot()+
geom_sf(data = phillyTracts %>%
st_transform(crs=4326))+
geom_point(data = dat_census %>%
mutate(hour = hour(start_time),
weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
group_by(start_station, from_latitude, from_longitude, weekend, time_of_day) %>%
tally(),
aes(x=from_longitude, y = from_latitude, color = n),
fill = "transparent", alpha = 0.4, size = 0.3)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
facet_grid(weekend ~ time_of_day)+
labs(title="Bike share trips per hr by station. Philadelphia, May, 2024")+
mapTheme
This multi-plot shows the need to balance stations by bringing
bikes from stations at the fringe of the network to stations closer to
the core, due to the fact that core stations will have more churn.
This focus on trips by stations reflects features not otherwise captured in the model, such as amenities near each station. Given that amenities are fixed in both time and space, inclusion of these amenities would result in mutli-colinearity. Therefore, our models capture the effect of these features using station locations themselves as a proxy rather than including the amenities in the model explicitly.
# length(unique(dat_census$interval60)) * length(unique(dat_census$start_station))
study.panel <-
expand.grid(interval60=unique(dat_census$interval60),
start_station = unique(dat_census$start_station)) %>%
left_join(., dat_census %>%
dplyr::select(start_station, start_station, Origin.Tract, from_longitude, from_latitude )%>%
distinct() %>%
group_by(start_station) %>%
slice(1))
# nrow(study.panel)
ride.panel <-
dat_census %>%
mutate(Trip_Counter = 1) %>%
right_join(study.panel) %>%
group_by(interval60, start_station, Origin.Tract, from_longitude, from_latitude) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
left_join(weather.Panel) %>%
ungroup() %>%
filter(is.na(start_station) == FALSE) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label = TRUE)) %>%
filter(is.na(Origin.Tract) == FALSE)
ride.panel <-
left_join(ride.panel, phillyCensus %>%
as.data.frame() %>%
select(-geometry), by = c("Origin.Tract" = "GEOID"))
ride.panel <-
ride.panel %>%
arrange(start_station, interval60) %>%
mutate(lagHour = dplyr::lag(Trip_Count,1),
lag2Hours = dplyr::lag(Trip_Count,2),
lag3Hours = dplyr::lag(Trip_Count,3),
lag4Hours = dplyr::lag(Trip_Count,4),
lag12Hours = dplyr::lag(Trip_Count,12),
lag1day = dplyr::lag(Trip_Count,24),
holiday = ifelse(yday(interval60) == 148,1,0)) %>%
mutate(day = yday(interval60)) %>%
mutate(hour = hour(interval60)) %>%
mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
holidayLag = ifelse(is.na(holidayLag) == TRUE, 0, holidayLag))
as.data.frame(ride.panel) %>%
group_by(interval60) %>%
summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
gather(Variable, Value, -interval60, -Trip_Count) %>%
mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
"lag12Hours","lag1day")))%>%
group_by(Variable) %>%
summarize(correlation = round(cor(Value, Trip_Count),2))
ride.Train <- filter(ride.panel, week >= 20)
ride.Test <- filter(ride.panel, week < 20)
ride.Train$dotw <- factor(ride.Train$dotw)
ride.Train$hour <- as.numeric(ride.Train$hour)
reg1 <-
lm(Trip_Count ~ hour + dotw + Temperature, data=ride.Train)
reg2 <-
lm(Trip_Count ~ start_station + dotw + Temperature, data=ride.Train)
reg3 <-
lm(Trip_Count ~ start_station + hour(interval60) + dotw + Temperature + Precipitation,
data=ride.Train)
reg4 <-
lm(Trip_Count ~ start_station + hour(interval60) + dotw + Temperature + Precipitation +
lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day,
data=ride.Train)
reg5 <-
lm(Trip_Count ~ start_station + hour(interval60) + dotw + Temperature + Precipitation +
lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday,
data=ride.Train)
ride.Test.weekNest <-
ride.Test %>%
nest(-week)
model_pred <- function(dat, fit){
pred <- predict(fit, newdata = dat)}
week_predictions <-
ride.Test.weekNest %>%
mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
ETime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg5, .f = model_pred)) %>%
gather(Regression, Prediction, -data, -week) %>%
mutate(Observed = map(data, pull, Trip_Count),
Absolute_Error = map2(Observed, Prediction, ~ abs(.x - .y)),
MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))
week_predictions %>%
dplyr::select(week, Regression, MAE) %>%
gather(Variable, MAE, -Regression, -week) %>%
ggplot(aes(week, MAE)) +
geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
scale_fill_manual(values = palette5) +
labs(title = "Mean Absolute Errors by model specification and week") +
plotTheme
This plot shows how model accuracy improves (MAE decreases) with
each feature added to the model. Interesting to note here is that the
model does not appear to improve when we add temporal proximity to
holidays. This may be due to the fact that there aren’t many major
holidays in the window of time that we used to train the model.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station)) %>%
dplyr::select(interval60, start_station, Observed, Prediction, Regression) %>%
unnest() %>%
gather(Variable, Value, -Regression, -interval60, -start_station) %>%
group_by(Regression, Variable, interval60) %>%
summarize(Value = sum(Value)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
geom_line(size = 1.1) +
facet_wrap(~Regression, ncol=1) +
labs(title = "Predicted/Observed bike share time series", subtitle = "Philadelphia; A test set of 2 weeks", x = "Hour", y= "Station Trips") +
plotTheme
This plot shows how each successive feature we added to our model
improved how tightly our predictions track with the observed
condition.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
from_latitude = map(data, pull, from_latitude),
from_longitude = map(data, pull, from_longitude)) %>%
select(interval60, start_station, from_longitude, from_latitude, Observed, Prediction, Regression) %>%
unnest() %>%
filter(Regression == "ETime_Space_FE_timeLags_holidayLags") %>%
group_by(start_station, from_longitude, from_latitude) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = phillyCensus, color = "grey", fill = "transparent")+
geom_point(aes(x = from_longitude, y = from_latitude, color = MAE),
fill = "transparent", alpha = 0.4)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
labs(title="Mean Abs Error, Test Set, Model 5")+
mapTheme
This plot shows higher errors at the core of the network than the
fringe. This is occurring in an area of higher density and activity, and
is likely the product of these factors.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
from_latitude = map(data, pull, from_latitude),
from_longitude = map(data, pull, from_longitude),
dotw = map(data, pull, dotw) ) %>%
select(interval60, start_station, from_longitude,
from_latitude, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
group_by(start_station, weekend, time_of_day, from_longitude, from_latitude) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = phillyCensus, color = "grey", fill = "transparent")+
geom_point(aes(x = from_longitude, y = from_latitude, color = MAE),
fill = "transparent", size = 0.5, alpha = 0.4)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(dat_census$from_latitude), max(dat_census$from_latitude))+
xlim(min(dat_census$from_longitude), max(dat_census$from_longitude))+
facet_grid(weekend~time_of_day)+
labs(title="Mean Absolute Errors, Test Set")+
mapTheme
When we break out the errors by time of day on weekday and weekend,
we can similar patterns in distribution to the the overall map of MAE.
On the weekends, the higher errors tended to occur further outside of
center city, likely due to greater ridership along the Schuylkill River
Trail.
control <- trainControl(method="cv", number=20)
# Fit the model
set.seed(123) # for reproducibility
model_cv <- train(Trip_Count ~ start_station + hour(interval60) + dotw + Temperature + Precipitation +
lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday,
data=ride.panel,
method="lm",
trControl=control,
na.action=na.pass)
# View the results
print(model_cv)
## Linear Regression
##
## 178704 samples
## 12 predictor
##
## No pre-processing
## Resampling: Cross-Validated (20 fold)
## Summary of sample sizes: 169768, 169769, 169769, 169769, 169769, 169769, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.040202 0.3320501 0.6456757
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
mae_values <- as.data.frame(model_cv$resample$MAE)
names(mae_values) <- "MAE"
mean_mae <- mean(mae_values$MAE)
std_mae <- sd(mae_values$MAE)
ggplot(data = mae_values, aes(x = MAE)) +
geom_histogram(binwidth = 0.001) +
geom_vline(xintercept = mean_mae, color = "red", linetype = "dashed", size = 1) +
geom_vline(xintercept = mean_mae + std_mae, color = "green", linetype = "dashed", size = 1) +
geom_vline(xintercept = mean_mae - std_mae, color = "green", linetype = "dashed", size = 1)
Here is a histogram of mean absolute error, with vertical lines for
mean (red) and standard deviation (green). A tighter histogram here
would represent greater generalizability of the model. The spread of
this histogram indicates decent generalizability, but it could be
better.
Our model is not great a capturing peak times, and it isn’t
particularly generalizable. However, it does capture the spatial and
temporal dimensions of bikeshare demand, which are a good starting point
for bike balancing, assuming that the model continues to be improved
over time. Some ways that the model could be improved are through
engineering of temporally-variable amenities in the form of businesses,
museums, and restaurants encoded spatially but also with the hours they
are open and closed.
Our data also looks at only May to June.
Given the high number of higher education institutions in Philly, it is
important to consider how the dip in the city’s population will be
reflected in ridership, particularly in areas of the city with higher
proportions of student housing and university buildings. Engineering
features to reflect this fluctuation would likely improve model
accuracy. It is also important to consider how bikeshare demand changes
across the seasons. Therefore, before this model is used to inform bike
re-balancing beyond May-June, it should be trained on data gathered
across an entire year.